2 research outputs found
Parsing Thai Social Data: A New Challenge for Thai NLP
Dependency parsing (DP) is a task that analyzes text for syntactic structure
and relationship between words. DP is widely used to improve natural language
processing (NLP) applications in many languages such as English. Previous works
on DP are generally applicable to formally written languages. However, they do
not apply to informal languages such as the ones used in social networks.
Therefore, DP has to be researched and explored with such social network data.
In this paper, we explore and identify a DP model that is suitable for Thai
social network data. After that, we will identify the appropriate linguistic
unit as an input. The result showed that, the transition based model called,
improve Elkared dependency parser outperform the others at UAS of 81.42%.Comment: 7 Pages, 8 figures, to be published in The 14th International Joint
Symposium on Artificial Intelligence and Natural Language Processing
(iSAI-NLP 2019
Constructing an academic Thai plagiarism corpus for benchmarking plagiarism detection systems
Plagiarism is a major problem in the academic world. It does not only undermine the
credibility of educational institutions, but also interrupts the processes of creating knowledge
in the academic community. To lessen this problem, many plagiarism detection systems have
been developed to detect plagiarized texts in academic works. In this paper, we describe the
design and process in creating an academic Thai plagiarism corpus. This corpus is necessary
for training and testing plagiarism detection systems for Thai. In order
to make this corpus a
comprehensive representation of plagiarism, the data has been divided into various types
based on the degree of the linguistic mechanisms used in plagiarism. Data compiled in our
corpus comes through two main methods: manually created by participants and automatically
generated by a program. After the corpus is created, its validity is verified by using three
measurements: a measurement of similarity between suspicious texts at the character level, a
measurement of similarity between suspicious texts at the word level, and a comparison of
different types of data compiled in the corpus based on the similarity measured. The results of
the analyses indicate that the corpus created by the proposed methods is effective in training
and testing plagiarism detection systems